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With the rapid evolution of computer technology has come the 
need to configure cheap computer systems with large processing power 
and high reliability. One method of achieving these goals has been 
to exploit the parallelism of multiprocessor systems. In recent years, 
an increasing number of multiprocessors have been designed and/or built, 
such as C.mmp at Carnegie -Mellon University, the TDC-316 multiprocessor 
at the Tata Institute of fundamental Research, Bombay, and the BBN 
Pluribus Interface Message Processor for the ARPA Network. This acti- 
vity has given a spurt to work on computer modelling to analyse the 
performance of such systems at the level of the processor-memory inter- 
face. 

Work on performance evaluation, however, still lags far behind 
the advances in multiprocessor technology. The performance of a multi- 
processor system is crucially dependent on the interconnection mechanism 



used for communication between the functional units. Hence the prime 
effort of ongoing research in this area is to devise better and more 
efficient interconnection schemes. The system designer must then have 
adequate tools available to enable him to evaluate and compare the per- 
formances of multiprocessors which use these various schemes. It is the 
problem of devising such evaluation techniques that we address ourselves 
to in this thesis. 

We start the thesis by describing some of the important inter- 
connection schemes being used in multiprocessors. These are the crossbar 
switch, time-shared bus, multiport memory/multibus, and a hybrid inter- 
connection scheme used in the Pluribus multiprocessor built by BBI'I for 
the ARPA network. Salient characteristics as well as the advantages and 
disadvantages of these schemes are discussed. 

We next give a sumpiary of the existing work on analytic models for 
the crossbar switch multiprocessors. Most of the past research on this 
topic has assumed that the memory references of each processor are uni- 
formly distributed among all the memory modules. Although this assump- 
tion considerably simplifies the analysis, it is not realistic, since 
programs generally exhibit the property of locality of references. 

The first new result in this thesis is the development of a model 
for crossbar switch multiprocessors with local referencing, which 
reflects more closely the behavior of real systems. This model is 
analysed using both discrete and continuous Markov chain techniques, and 
expressions are derived for the multiprocessor performance. Hew expre- 
ssions are also obtained for the performance in the traditional uniform 



reference model and are compared with other expressions available in 
the literature. Results of a simulation study are presented to demons- 
trate the accuracy of the expressions for both models. 

Almost all the work to date on computer modelling for analysing 
the performance of multiprocessor systems has been limited to the study 
of systems using a crossbar switch as the interconnection medium. As 
mentioned earlier, the tools of analytic modelling need to be improved 
to keep pace with the innovative development of new interconnection 
schemes. One of the main contributions of this thesis is the construc- 
tion of analytic models for multiprocessors using the time-shared bus 
and the hybrid Pluribus scheme as the interconnection structures. 

A discrete Markov chain model for time-shared bus multiprocessors 
is described. An example is given to explain the detailed analysis 
technique and simulation results are presented to verify the results 
of the analysis. 

Next, a model for evaluating the performance of the Pluribus 
multiprocessor is described. The Pluribus system is a hybrid contain- 
ing a crossbar switch and a number of time-shared buses. The analytic 
model described here breaks the system into its crossbar switch and 
time-shared bus components, simultaneously taking into account the 
complex interaction between these components. The crossbar switch is 
then analysed in terms of an existing model while the time-shared bus 
component is analysed using the model developed earlier. These results 
are synthesized to give the performance of the whole system. Graphical 
results are presented to show the effect of the various parameters of 
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the system on its performance. Simulation results presented validate 
the model. 

Finally, seme suggestions are made for further worlr in this area. 



CHAPTER 1 


INTRODUCTION 

With the rapid evolution, of computer technology has come the 
need to construct computer systems which will solve larger problems 
in less time with higher reliability. Parallel processing represents 
one of the more effective ways of achieving these goals- The initial 
development of systems incorporating parallelism was almost entirely 
motivated by considerations of reliability. Thus redundancy was pro- 
vided at various levels of the system to cater for catastrophic failure 
situations. 

It gradually came to be realized, however, that redundant components 
could actually be put to use to improve the performance of the system. 

If some of the resources fail, dyn ami c re-allocation of the remaining 
resources then results in graceful degradation or "fail-soft” operation. 

Another advantage of several parallel systems is their flexibility. 
Plexibility, in the words of Searle and Preberg f Sea 75 J "is a measure 
of the ease with vtfaich a system configuration can be altered.” Por a 
system to be truly flexible, both its hardware and software should be 
capable of easy alteration. A better understanding of operating systems 
for large parallel systems has emerged recently, thus allowing them to be 
made more flexible. In fact, availability of better software is one of 
the factors responsible for the increasing popularity of such systems. 

However, the greatest potential benefit of parallel systems lies in 
their performance capabilities. Electronic technology appears to be 
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approaching Units imposed by electrical propagation delays. Parallel 
processing offers an attractive means of overcoming this problem. Plum- 
metting hardware costs have also improved the cost/performance viahility 
of these systems. Hie future is thus certainly going to witness an in- 
creasing exploitation of concurrency and parallelism at all possible 
levels. 

1.1 Types of Parallelism 

The term "parallel processing" encompasses in its scope a wide 
variety of computer systems. One classification has been given by 
Plynn [Ply 66 , Ply 72b], who divides computer sy, stems into four categories: 

(a) Single Instruction Single Data (SISD), 

(b) Single Instruction Multiple Data (SIMD), 

(c ) Multiple Instruction Single Data (MISD), and 

(d) Multiple Instruction Multiple Data (MIMD). 

SISD covers the usual uniprocessor computers. Associative processors, 
processing ensembles, and array processors, such as the ILLIAC IY, fall 
in the SIMD category. Pipeline processors may be considered to be either 
of the SIMD, MISD, or MIMD architectures. Multiprocessors and multi- 
computer systems belong to the MIMD class. 

1 .2 Multi processors 

Multiprocessors, the subject of this thesis, as distinct from 
multiple-computer systems, are not easy to define. The difference between 
a multiple-computer system and a multiprocessor is in the extent and degree 
of sharing : whereas the f cornier consists of several separate and discrete 
computers, the latter is a single computer with multiple processing units. 



3 


Enslow [Ens 74, Ens 77 1 defines a multiprocessor as a system with 
the following characteristics: 

(a) It contains two or more processing units of approximately 
comparable capabilities; 

(b) ill processors share access to a common memory (although 
some private memory may be allowed); 

(c) ill processors share access to input/ output channels, 
control units, and devices; 

(d) There is a single integrated operating system in overall 
control of all hardware and software? and 

(e) There must be intimate interaction possible at both hard- 
ware and software operating levels. 

Figure 1 depicts the basic structure of a multiprocessor system, 
lhus a multiprocessor has capabilities for the sharing of memory and 
input/ output devices by all processors; the input/output devices also 
have complete access to memory. Hence the interconnection system has 
to support three types of communication : processor-memory, processor- 
1/0, and memory-l/O. 

Although multiprocessors have all the three advantages of reliability, 
flexibility, and higher performance mentioned earlier, they pose a number 
of problems not encountered in single-processor systems. A multiprocessor 
system must have special facilities, both h;ordware and software, to resolve 
contention for shared resources. The operating system is larger and more 
complex than for uniprocessors. To properly exploit the available para- 
llelism, tasks need to be divided into subtasks which can be executed in 
parallel. This makes scheduling considerably more complicated. 




Processors 


FIGURE 1 : Structure of a Multiprocessor Systen 
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At the hardware level, proper mechanisms have to be provided for 
communication between the various functional units. The interconnection 
system must have high bandwidth and must be reliable. A processor should 
have the capability of interrupting other processors. Efficient failure- 
detection is important for high reliability and both manual and automatic 
reconfiguration should be possible. 

Performance monitoring and evaluation of multiprocessors are not only 
more complex than for uniprocessors, they are also more important. Por a 
long time there had been an impression that multiprocessors were not capa- 
ble of high performance and that they were not cost-effective. With better 
evaluation techniques, these mistaken notions are now being dispelled. 

Work on performance evaluation, however, still lags behind the advances 
in multiprocessor technology. The performance of a multiprocessor system 
is cruci all y dependent on the interconnection mechanism used for communi- 
cation between the functional units. Hence the prime effort of ongoing 
research in this area is to devise better and more efficient interconnection 
schemes. The system designer must then have adequate tools available to 
enable him to evaluate and compare the performances of multiprocessors 
which use these various schemes. It is the problem c£ devising such evalua- 
tion techniques that we address ourselves to in this thesis. 

1 .3 Overview of the Thesis 

We start in Chapter 2 by describing some of the important inter- 
connection schemes used in multiprocessors, namely, the crossbar switch, 
time-shared bus, multiport memory/multibus, and a hybrid interconnection 
scheme used in the ELuribus multiprocessor built by Bolt, Beranek, and 
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Newaan., Inc. (BBN), Cambridge, Mass., far the AEPA Network. Salient 
characteristics as well as the advantages and disadvantages of these 
schemes are discussed# 

y/e begin Chapter 3 with a summary of existing work on analytic 
models for crossbar switch multiprocessors. Most of the past research 
on this topic has assumed that the memory references of each processor 
are uniformly distributed among all the memory modules. .Although this 
assumption considerably simplifies the analysis, it is not very realistic, 
since programs generally exhibit the property of locality of references. 

In Chapter 3, we develop a model for crossbar switch multiprocessors 
with local referencing, which reflects more closely the behavior of real 
systems. This model is analysed using both discrete and continuous Markov 
chain techniques, and expressions are derived for the multiprocessor per- 
formance* New expressions are also obtained for the performance in the 
traditional uniform reference model and are compared with other expressions 
available in the literature. Results of a simulation study are given to 
show the accuracy of the expressions for both models. 

Almost all the work to date on computer modelling to analyse the 
performance of multiprocessor systems has been limited to the study of 
systems using a crossbar switch as the interconnection mediutfu As men- 
tioned in the previous section, the tools of analytic modelling need to 
be improved to keep pace with the innovative development of new/ inter- 
connection schemes. Keeping this aim in view, one of the main contri- 
butions of this thesis is the construction of analytic models for multi- 
processors using the time-shared bus and the hybrid Pluribus scheme 


as the interconnection structures. 
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Chapter 4 describes a discrete Markov chain nodel for time-shared bus 
multiprocessors. An example is given to explain the detailed analysis 
technique and simulation results are presented to verify the results of 
the analysis. 

A nodel for evaluating the performance of the Pluribus multiprocessor 
is described in Chapter 5. The Pluribus is a hybrid system containing a 
crossbar switch and a number of time— shared buses. The analytic model 
described there decomposes the system into its crossbar switch and tine- 
shared bus components, simultaneously taking into account the complex 
interaction between these components. The crossbar switch is then analy- 
sed in terms of an existing model while the model cf Chapter 4 is used to 
analyse the time-shared bus component. These results are synthesized to 
give the performance of the whole system. Graphical results are presented 
to show the effect of the various parameters of the system on its perfor- 
mance. Simulation results are given to verify the validity of the model. 

Chapter 6 presents the conclusions and suggests the directions that 
future work in this area may take. 



CHAPTER 2 


MULTIPROCESSOR INTERC OHHECTI ON STRUCTURES 

Ctf paramount importance in a nultiprocessor system is the c ornmnni - 
cation mechanism and the node of interconnection between its functional 
units, namely, the processors, memory modules, and the inpuit/ output units. 
Sharing of nenory nodules between multiple processors and I/O units results 
in conflicts between units desiring to access the sane nenory nodule at the 
sane tine. This phenonenon is called nenory interference and it is the 
prinary cause of degradation in the nultiprocessor performance. Thus the 
nain task of an analytic nodel for a nultiprocessor is to estimate the 
amount of nenory interference in the system. 

In the models considered in this thesis, the effect of input/output 
■units \'/ill not be modelled explicitly. This is because, in most cases, 
their effect on the overall performance of the system is insignificant 
[Str 70] . Por example, transf erring with four drums or 15 fixed head 
disks at full rate is comparable to the activity of one processor [Bel 71] . 

Some of the important interconnection media used in multiprocessors 
are the crossbar switch, time-shared bus, multiport nenory/nultibus, ana 
a hybrid interconnection scheme used in the BB1-T Pluribus multiprocessor. 
There are a number of good papers which focus on multiprocessor inter- 
connections T And 75, Bae 76, Ens, 74, Ens 77, Per 73, Sea 75, Swa 76] . 

In this chapter, we shall discuss the salient features as well as some 
of the advantages and disadvantages of these interconnection schemes. 

This discussion, however, constitutes only one way of interpreting these 
various schemes. There are other ways of viewing these structures as is 
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obvious from the references cited above. Swan et al fe' s /a 76 \ for instance, 
regard these interconnection structures as mere variants of one fundamental 
structure, namely, the crossbar switch,, 

2,1 Crossbar Switch 

In the crossbar switch organisation, shown in figure 2, any memory 
module can be connected to any processor. A full-time connection is 
established between the two units for the complete duration of the trans- 
fer. In the absence of a conflict, multiple connections are possible at 
a time. Such an organization is characterized by high bandwidth; in fact, 
among all the interconnection schemes, the crossbar switch has the poten- 
tial for the highest total system transfer rate. 

Since all the circuitry for conflict resolution is incorporated in 
the switch itself, the control logic of the memory modules is very simple. 

If the switch is distributed, the system can be made modular as well as 
reliable, and additional processors and/or memory modules can be added 
without too much difficulty. 

However, the crossbar switch is extremely complicated and costly. 

It has been estimated that the cost of a switch for a large system is 
comparable to the cost of a few processors [Ens 74]. 

An important example of a multiprocessor system using a crossbar 
switch is C.mmp, the multimini processor built at Carnegi e-Mellon University 
[Bel 71, Wul 72] , Hie TDC-316 multiprocessor under construction at the 
Tata Institute of fundamental Research, Bombay [ Jos 76, Hay 76 ], also 
employs a crossbar svatch. 



Memory Modules 
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Crossbar 
I Switch 


3? I CURE 2 : Crossbar Syiteh C orf igurati on . 


Processors 



Memory Modules 



PIGUEE 3 : Tine-Shared Bus Configuration. 
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2.2 Tine- shared Bus 

The tine-shared bus system, shown in Figure 3, is one of the simplest 
and cheapest interconnection schemes. Here there are no continuous connec- 
tions between functional units. All the units are connected in parallel 
to the bus which allows communication between one pair of units at a time. 
Since there is only one transfer path for all transfers, the system band- 
width and efficiency are low and the reliability is poor. It is very easy 
to physically modify the system configuration by adding or removing func- 
tional. units. However, system expansion results in considerable degrada- 
tion of the overall system performance. 

For these reasons, the use of the tine— shared bus is limited and it 
is not generally used for high-performance multiprocessors. Examples of 
systems using a time-shared bus are the PDP-1 1 and the Lockheed SUE. A 
good description of the operation of a bus nay be found in [Thu 72] and 
in &an 71 ]. 

2.3 Multiport Menory/Multibus 

The multiport neuory/nultibus system, shown in Figure 4, tries to 
overcome seme of the disadvantages of the single time— shared bus. It is 
also less costly than the crossbar switch. For achieving high band- 
width, each processor may be assigned a dedicated bus, although this is 
poor from the point of view of reliability. If a bus bottleneck is 
present, expanding the number of buses increases the system throughput. 

However, it is essential, in this organization, for the memory 
modules to have a number of access ports which makes the control logic 
of the memories more complex. The cabling and connector costs are large 
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PI CURB 4: Multiport Hfenory/Multibus Configuration 
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and become more pronounced as the system expands. The maximum configura- 
tion possible is also United by the number of ports available on a memory 
nodule . 

The UUIVAC 1108 and the IBM System 360 Model 67 are examples of such 
systems. 

2.4 Pluribus Organization 

This interesting unconventional schene ha.s been used by BEN for their 
Pluribus multiprocessor to be used as an Interface Message Processor for 
the APIA Network F Bar 75 , Hea 73, Orn75] • This is a hybrid between the 
crossbar switch and the tine-shared bus systems and is shown in Figure 5. 

In this system, there are a number of processor buses, each contain- 
ing a few processors and a few local memories connected by a time-shared 
bus. The local memories on a processor bus are accessible only to the 
processors on that bus. There are also additional memory buses which con- 
tain only memory modules. These memories are global and are accessible to 
any processor via the crossbar switch. In the system built by KBIT, there 
are 7 processor buses each having 2 processors and 2 local memories, and 
2 memory buses each having 2 global memories. The crossbar switch is dis- 
tributed and is implemented by interconnecting units called bus couplers 
on various buses. 

The Pluribus system is cheaper than the pure crossbar switch schene 
and does not have the bandwidth limitations of the single bus schene. 
However, for maximum efficiency, the bandwidths of the various components 
of the system must be carefully matched. 
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PIG-DEE 5 : PLURIBUS Configuration- 
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2.5 Definitions end Notations 

A mathematical model of a system nay be constructed at various levels 
of abstraction [Bha 76] . The interconnection nechanism in a multiprocessor 
f ohs the interfacebetween the processors and the nenory nodules. This 
interface has a significant effect on the multiprocessor performance ; for 
this reason, the models considered in this thesis are all at the level 
of the processor-memory interface. 

At this level, the interaction between the processors and memories 
must be very precisely defined. In general, processor behavior varies 
for different instructions. However, we shall not explicitly model these 
differences in instructions. We shall also make no distinction between 
the processing needed to decode an instruction and the processing corres- 
ponding to its execution. Instead we shall use the concept of a unit 
instruction, first proposed by Strecker [Str 70], which simply models 
the fetching of a word from memory followed by the processing of the word 
by a processor. 

A diagrammatic representation of a unit instruction is shown in 

Figure 6(a). In this figure, t ! represents the processing tine of the 

P 

processor, t is the memory cycle time, V the nenory access tine and 
t^ the nenory rewrite time. 

In most cases of interest to us, t ' will generally be greater than 

Jr 

or equal to t ' . In these cases, to facilitate our discussion, a trans- 

formation may be made on the diagram of Figure 6(a) to give Figure 6(b). 

The memory now has an access time of t and zero rewrite time; the new 

c 

processing tine is tp = t^ - V. This transformation introduces no change 
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in the sense that the perf omanee of the system in either case is the 
sane. In both cases, the nenory access begins at point 1, the nenory 
is ready to service a new request at point 2, and the processor execu- 
tion is completed at point 3. (This transformation was first used by 
Strecker [ Str 70l). Thus, in all our nodels, without loss of generality, 
we shall assume the memory access time to be equal to the cycle tine, 
and the rewrite tine to be aero. 

The performance neasure used most often in this thesis is the Unit 
Instruction Execution Rate (uER) which is the total number of unit ins- 
tructions executed by the system in unit tine (generally 1 Usee.). Y7e 
would like to emphasize, however, that other performance nea.sures, such 
as utilization factor, percentage idle tine, etc., can all be reduced to 
UER, and given one, all the others can be ea.sily calculated, if necessary. 
Thus, processor utilization = V x UER, nenory utilization = t Q x UER 
etc . 



CHAPTER 3 


MODELS POR CROSSBAR SWITCH SYSTEMS 

In this chapter, we shall be concerned with analytic models for 
multiprocessors using a crossbar switch. In Section 3-1, we give a 
summary of past work in this area. All these existing models make the 
assumption that the memory references of each processor are uniformly 
distributed among the memory modules. Although this assumption consi- 
derably simplifies the analysis, it is not very realistic, since pro- 
grams generally exhibit the property of locality of references. We 
shall propose a model with local referencing, which reflects more closely 
the behavior of real systems. Section 3*2 lists in detail the assump- 
tions made for this model. In Section 3.3, we use discrete Markov chain 
techniques to analyse this model. Section 3*4 presents an alternative 
analysis using continuous Markov chain processes. Simulation results 
are given in Section 3.5. finally, new expressions for the uniform 
reference model are developed in Section 3.6. 

3 *1 Review of Existing Models 

In this section, we consider systems with p processors and m memory 
modules. We shall denote by t the average processing time of all pro- 
cessors (which are assumed to be identical); all memory modules are assumed 
to have equal constant cycle times t with access time t and rewrite 

C ct 

time t . 
w 
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Sk_nner and Asher [ Ski 69 ] were the first to use Markov chain 

models to analyse multiprocessors. However, their study ms limited 

to a small number of processors and memory modules, and they found 

it difficult to generalize their expressions for larger systems. 

Strecker Estr 70? using some simplifying assumptions was able to 

give general approximate expressions. He considered three cases: 

(a) t ^ t , (b) t = t , and (c ) t > t . Of these, the third case is 
pw 7 pw’ pw ? 

important because many real systems fall in this category. Strecker 

gives the following expression for the performance of a multiprocessor 

system for case (c) (t > t ): 

p w 

UER = (m/t c )(l - (1 - P/m) P ) 
t - t 

where P + (m/p)(- 2 - -)(l - (l - P /m) P ) -1=0 (3.1 ) 

iii u m 

c 

In Chapter 3 we shall have occasion to use this equation. There 
we shall assume t w to be zero; since t is always positive, our system 
will fall under case (c) and Equation (3.1 ) will be applicable to it. 

It should, however, be remembered that Strecker' s analysis is approximate, 
not exact. 

Bhandarkar [Bha 75l used discrete Markov chain models to analyse 
. ,'os (b) and (c) above. He used this analysis to ’write a program to 

-2-ute exact values for the system performance. However, his program 
time-consuming for even moderately-sized systems. Bhandarkar 
tfied Strecker 1 s expression for ctase (b) in the light of the 
^Its available from his program. Bhandarkar and Puller 

sed multiprocessors using a continuous- time Markov chain 


20 


model . 

In this chapter we shall consider only those multiprocessor systems 
in which the memory is partitioned into modules by the higher order bits 
of the address. Memories interleaved by the low order address bits have 
been studied by Burnett and Coffman [Bur 70, Bur 73 , Bur 75] , also 
jointly with Snowdon [Cof 71 1 , and by Sastry and Kain pas 75 ] . It 
should be noted that if we assume uniformly distributed memory references 
by ea.ch processor, then the behavior of low-order interleaved memories 
is no different from that of high-order interleaved memories. Baskett 
and Smith [Bas 76 1 have given asymptotic results for low-order inter- 
leaved memories with 'uniformly distributed references. Thus their physi- 
cal model is the same as that of Bhandarkar, with the difference that they 
have studied its asymptotic behavior. 

3.2 Assumptions for local Reference Model 

The following major assumptions characterize the local reference 
model developed in this chapter: 

Assumption 1 : The system has p processors and m memory modules. All 
processors and all memory modules are identical. This will be referred 
to as a p x m system. 

Assumpti on 2 : Instructions of the processors are modelled using the 

concept of the unit instruction defined in Section 2.5. 

Assumption 3 : All memory modules have equal constant cycle times and 
their operation is synchronized with no overlapping of read/write cycles. 
The access time of each module is equal to its cycle time, and the 
rewrite time is zero (see figure 6(b)). The processing time of each 
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processor is assumed to be zero. 

Assumption 4 : The processors and memories sore connected by a crossbar 

switch which permits every processor to have access to every memory 
module. All memory modules are simultaneously accessible so that, under 
no conflict, a maximum of min(p,m) words can be fetched simultaneously. 

The switch is assumed to have zero delay. However, crossbar switches 
with nonzero delay may be modelled by simply adding the delay to t c , the 
memory cycle time. 

Assumption 5 • Prom each memory module only one word can be fetched at 
a time. If two or more processors simultaneously make requests for the 
same memory module, only one of these requests can be served in the next 
memory cycle. The other processors are queued up at the module to be 
served in subsequent cycles. 

Assumption 6 : Consecutive addresses in memory are mapped into the same 

module modulo the module size. Thus the high-order bits of an address 
determine the module to which the address belongs. 

Assumption 7 : Successive requests of a processor follow the pattern 

described below. If the k-th request of a processor is for memory module 
i, then its (k + 1 )st request will be for module i with probability a, 
and for module j (j / i) with probability (l - c*)/(n - 1 ). Thus all memory 
modules except module i are accessed with equal probability. Probability 
a is a constant and is equal for all processors. 
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Iu should be noted that if « = i/m then all memory modules are 
accessed with equal probabilityj and this model reduces to the uniform 
reference model analysed by Bhandarkar fBha 75 ]• However, in general, 

(X will not equal 1/m, in which case we shall call this the local r eference 
model. 

If ciis large compared to 1/m, the processor mil tend to access 
the same memory module repeatedly until it changes to a different module, 
and the same behavior is repeated. It is our belief that this model is 
more representative of real-life multiprocessor systems than the uniform 
reference model. A multiprocessor system generally works in a multi- 
programming environment in which each processor executes a more-or-less 
independent task. Thus each processor would concentrate its attention 
on blocks of consecutive addresses which, in our model, are mapped into 
the same module. Thus the probability of consecutive references being 
to the same module is quite high. Occasionally a task may be split into 
one or more modules; references may also be made to the executive which 
may reside in a different module. But this happens relatively infrequently; 
programs are also mostly sequential in nature and present-day programming 
styles emphasize modular programs. Hence the parameter a, though not 
equal to 1, will be quite close to it. It seems reasonable to assume 
that most such environments will have a > 0,75. We shall show later 
that the multiprocessor performance is more or less unaffected by the 
value of a so long as a lies in this range. However, the performance of 
systems with a > 0.75 is worse than that predicted by the uniform 


reference model 
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The performance measure used in this chapter is the Average Amber of 
Busy Memory Modules (AEBM). This is the- average number of memory nodules 
that arc busy during a memory cycle. Using this measure will permit us to 
conveniently compare our results with other results available in the litera- 
ture. Its relation with the Unit Instruction Execution Rate (UER) is given 
by 

UER = ANBMj/t . 

3.3 Discrete Ivlarkov Chain Analysis 

In this section we shall analyse the local reference model, defined by 
the assumptions of the previous section, using discrete Markov chain tech- 
niques. An excellent description of Markov processes nay be found in 
KLeinrock FKLe 75]. Bhemdarknr [ Eha 75] has developed a systematic approach 
to the use of the discrete Markov chain technique for analysing memory 
interference in multiprocessor systems. \7e shall use this technique in the 
analysis of this section, and again in Chapter 4 for analysing the model 
for the time-shared bus. 

The exact analysis of the Markov chain model is very complex, even 
for the uniform reference model. For this reason, Bhandarkar did not 
attempt to derive general expressions for the system performance with p 
and n as parameters. Instead, he wrote a program to compute the Average 
Humber of Busy Memory Modules for any given p x m system. 

In this section, we shall derive such expressions for the local 
reference model with m as a parameter for small, constant values of p 
(such as 2 or 3), and correspondingly, with p as a parameter for small, 
constant values of m. Approximations of these expressions will then be 
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generalized, to hold for all values of p and n. We shall use the sane 
approach with the uniform reference model in Section 3.6. It should 
be noted that the local reference nodel has & as an additional parameter 
naking the analysis nore complicated. Our interest will, however, lie 
in systems for which a exceeds 0.75. 

It any given time, the state of a p x n system can be characterized 
by the lengths of the queues at each memory module. Following Ehandar- 
kar's notation, this state is denoted by an n- tuple (k 1} k 9 , ..., k ), 
where 

n 

£ k = p, 
i=1 

and 0 _< k^ jc p for 1 < i <n. Integer k^ represents the number of 
processors waiting in the queue at nodule i (including the processor 
being served). Since all processors are identical, a number of these 
states are equivalent, such as states (2,2,1 ), (2,1,2) and (l,2,2). 

Every such equivalence class will be called a reduced state. In the 
notation of a reduced state, we shall generally omit all 0’s, e.g., 
state (2,1,0,0) will be written simply as (2,1 ). For any given value 
of n, this notation is unambiguous. 

let us consider a 2 x n system, in which there are two processors 
and m >_2 memory modules. This system has only two reduced states, 
s^ = (2) and Sg = (l,l). Consider state s^ = (2). At the end of a 
memory cycle, the resultant partial state is (l ) with one free processor 
to be reassigned. This nay be assigned to the sane memory nodule with 
probability a and to a different nodule with probability (l - a). Thus 
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transitions iron state s^ to states s^ and will occur with probabi- 
lities rvand (l - a) respectively. Following a similar procedure for 
state s^, the transition matrix 1 can be shown to be: 

j a i - a 

I (l - O') (pa + u - 2) n 2 + n( a - 5 ) + 5 - 2a 
L (n - 1 f (n - 1 ) 2 J 

By computing the steady-state probabilities for this Markov chain, 
it can be shown that the Average Number of Busy Memory Modules for a 
2 x n system is given by 

= slSUiill ( 3 . 2 ) 

n(.m + - 1 )-1 

Let us now study a p x 2 system having two memory modules and 
p ^>2 processors. This system has I p/2_[ + 1 states): 

(p), (p ~ 1 , l), (p -2,2), ..., ( j_(p + 1 )/2_J, |p/2 1) * 

For example, if p =8, then the states are (8), (7,1 ), ( 6 , 2 ), (5,3), and 
(4,4)? if p = 9, then the states are ( 9 ), (8,1 ), (7,2), ( 6 , 3 ), and (5,4). 

The transition matrix can now be obtained, and it can be shown that 
for a p x 2 system: 


ANBM = - 2 ^ P ~ ( 3 *3) 

p + 2 ce - 1 

If we substitute a = 1 in equations ( 3 . 2 ) and (3.3), we get respec- 
tively : 


ANBM 


2 


2 

n+1 


(3.4) 


and 


ANBM 



(3.5) 


Foot note : 

j^x | denotes the largest integer smaller than or equal to x. 


1 
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We shall show in Section 3.5 that when a lies in the range 0.75 
to 0.95, Equations (3.2) and (3.3) can be approximated, without any 
significant loss in accuracy, with Equations (3*4) and (3*5) respectively. 

Now consider a 3 x n system with three processors and n 3 memory 
modules. This system becomes exceedingly difficult to analyse for an 
arbitrary value of probability a using the method employed in the analysis 
of the 2 x n and p x 2 systems, because of the large number of states 
involved. However, if we assume c = 1 , the problem is more tractable, 
and it can be shown that for a 3 x n system with O' = 1 

AM = 3 - (3 * 6) 

That this is a good approximation to the actual value when ^ lies in 
the range 0.75 to 0.95 is borne out by the simulation results discussed 
in Section 3.5* 

A general expression suggested by the three Equations (3.4), (3.5), 
and (3*6) is that in a p x m system with c ' = 1 the Average Number of 
Busy Memory Modules should be given by 

MW = p - - 1 - = — -SB y (3.7 

y m + p - 1 n + p - 1 

It should be noted that this equation is symmetric with respect to ni 
and p. The nature of the actual values of ANBM and the accuracy of 
these approximations will be explored in Section 3*5. 

5*4 Continuous Markov Chain Analysis 

In this section, we shall use a continuous-tine Markov chain 
model to analyse the local reference model. Interestingly, the 
exact solution for this method is the sane as Equation (3*7) which 
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was an approximate solution for the discrete method. 

Continuous-tine Markov chains were used by Bhandarkar and Fuller 
paa 73] to analyse the uniform reference nodel. To use this technique, 
wo need to abandon the assumption of constant nenory cycle tine and assune 
that nenory cycle tines are exponentially distributed. Although this 
assumption is not very realistic, it nay be useful to view the resulting 
expression as a lower bound on the system performance [ Bha 73 1- Moreover, 
this nethod gives us an expression that is valid for all values of r/. 

The formulae used here were derived by Jackson [jac 63 ], and Gordon 
and Newell [i&or 67] we shall, however, use the notation of Ileinrock 
[He 76, Section 4-12]. Our nodel is now viewed as a. closed queuing 
network with n service centers and p permanent customers. Transitions from 
one center to another are determined by a routing probability matrix R. 

The element r. . of this matrix gives the probability of going to center 

-L J 

3 on completion of service at center i and, in our nodel, is equal to Ci 

when i = j and (l - c )/( n - l) when i / j. States are denoted by a 

vector k = (k. ,k oS k ) as in the discrete model. The equilibrium 

i z 7 n 

probability pCk^k^, ..., k ) is given by 

^ m k. 

P ()W •••' = gG7 ^ 

where 


G(p) w 


I 


n 

IT 

i=1 



A is that set of vectors k for which 
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n 

2 

i=1 


\ = P» 


n 


= X n , X . are the solutions of X = 2 X IL. and H - are the 
* - 7 n 3 J 4 - 


"i r i ' i 1 

neon service rates of* the service confers « 


Substituting for in the last equation.* v;e get 

X . = o' X i + 2 -P~— “4 * X. 

l . (.n - 1 j 3 

3 


or 


X. 


1 N 

~ r 2 X . 

n-1 3=1 3 

3 A 


which gives 


1 


I 


^i n ^ Xj* 

Thus all X ! s are equal and independent of Fran here on, the 

i 

analysis is exactly the sane as done by Bhandarkar and Fuller [Bha 73]* 
Solving for the equilibrium probabilities, we find that 

p(fc,,k 2 , k Q ) = ) 

for all k. Thus, all the states of the system are equally likely. 

This gives the Average Number of Busy Memories as 

™= sir- 

This equation is identical to Equation (3-7 )• It was first derived 
in [Bha 731 for the unifoxn reference nodel, i.e., when a = l/u, which 
is a particular case of our derivation. 
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7e thus find that under the assumption of exponentially distributed 
nenory cycle tines, the performance of the systen is independent of v . 
Equation (3-7) nay be viewed as a lower bound on the performance of 
real-life systems in which the cycle tine is not exponentially distri- 
buted. Similarly, idle discrete Markov chain node! gives an upper bound 
since the processing tine of a real-life systen is never a constant and 
is better approximated by the exponential distribution [Bha 75] . 

3 .5 Simulation Results 

Simulation studies were conducted to validate the local reference 
model and to provide the basis for compering the expressions derived 
in the earlier sections of this chapter. The programs were written in 
FORTRAN IV and run on an IBM 7044. To find the steady-state systen 
performance, the number of busy memory modules in a. cycle was averaged 
over a total of 5000 memory cycles. This amounted to the processing 
of between 7000 and 33000 instructions (approximately ) by the multi- 
processor systen, depending on the number of processors and nenory 
nodules. 

Figure 7 shows the Average Number of Busy Memory Modules plotted 
as a function of « for various values of p and m. This figure clearly 
demonstrates that for a given multiprocessor systen, ANBM falls as & 
increases from 0 to 1. However, over the range « >0.75, the varia- 

tion in ANBM is very small. Thus the system performance nay be 
accurately represented by the average value of the ANBM figures in this 


range 
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As nentionsd in Section 3.2, a multiprocessor system is generally 
expected to have o > 0.75 . She averages of the values of AHBM corres- 
ponding to ot= 0.75, 0.8, 0.85, 0.9, and 0.95 were computed and these 
are shown in Table I for various p x n systems. These averages are 
compared with values obtained from Equation (3 .7). In no case does 
the error exceed 4 percent. Thus, Equation (3.7) provides a very good 
approximation for the performance of real-life systems. The sane is 
true of Equations (3.4), (3*5), and (3.6), since they are particular 
cases of (3 .7). 

A comparison of the performances predicted by the uniform reference 
and local reference models is shown in Figure 8. The values used for 
the discrete Markov chain uniform reference model are taken from [Bha 75], 
while those for the local reference model are computed from Equation (3.7). 
As we sa.w in the previous section, Equation (3-7) forms a lower bound on 
the performance for all values of a. It also forms an approximate upper 
bound for systems with r> > 0.75. Thus, for these systems, this equ"tion 
is a very good estima.te of the performance. The upper bound for the 
uniform reference model is higher than Equation (3.7); therefore the 
performance predicted by this model is generally more optimistic than 
it would be for reed -life systems (with n >0.75). Simulation results 
for the case a = 0 are also shown in Figure 8. It is evident from 
the figure that the performance of multiprocessor systems would be improved 
if programs have addressing patterns that would make a close to 0. 



AVERAGE NUMBER OE BUSY MEMORY MODULES (ANBM) 
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3 .6 Uniform Reference Model 

In this section, we shall derive expressions for the unifora 
reference nodel using the nethod followed in Section 3.3« Since the 
local reference nodel becomes equivalent to the unifom reference 


nodel upon substituting ot = l/n, we get straightaway fron Equations 
(3*2) and (3.3) that for uniform reference models 

ANBM = 2 - 1/n (3.8) 

and ANBM = 2 - 1/p (3-9) 

for the 2 x n and p x 2 systems respectively. 

A similar discrete Markov chain analysis nay be done for the 
3 x n system (m 2 . 3 ) to give 


ANBM = 3 - “ + 

n 3 _ 2 


( 3 - 10 ) 


n -n -ta 

in the case of unifom reference nodels. Note that all three expressions 


(3.8), (3.9), and (3.10) are exact. In (3.10), the last tern becomes 
small when n increases, so we may write as an approximation: 

ANBM = 3 - 3/n (3-1 1 ) 

Following a similar process, for a 4 x n system (n ^4) we find that^ 


ANBM = 4 - - + 
m 


In'' 


12m + 14m - 12 


, 5 4 3 2 x 

n(n -3m +8u -1 In -+8n-4 ) 


( 3 . 12 ) 


Vfaen n is large, Equation ( 3 . 12 ) may be approximated by 

ANBM = 4 - 6/n 


(3.13) 


Footnote : 

2 We owe the correct form of this equation to Dr. D.P. Ehnndarkar. 
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A general expression, suggested by Equations (3*8), (3.1l), and 
(3.13)j is that in a p x n system (n > p) the Average Number of Busy 
Mcnory Modules (for a uniform reference model) should approximately 
be given by 

ANBM = p - p(p-l)/2n (3-r14) 

However, since we know from [Bha 75 ] that the performances of an A x B 
system and a B x A system are almost equal, we may write 

ANSI = 3 - ^u- 1 — (3.15) 

where i = max(m,p) and 3 = nin(m,p). 

let us now compare expression (3.15) with two others available in 
the literature for the uniform reference model, first, Strecker's 
expression [Str 70} , as modified by Bhandarkar [Bha 75], is ’ 

ANH'l = i [ 1 _ (l _ l)^ ] (3.16) 

where i = nox(n,p) and 3 = nin(n,p). Second, an asymptotic expression 
given by Baskett and Smith JBas 76 ] is: 

ANBM = u + p - (m 2 + (3.17) 

In order to compare expressions (3.15), (3*16), and (3.17), wo have 
used the exact numerical results given by Bhandarkar [Baa 75] and beyond 
Bhandarkar’ s with results obtained in the simulation study described in 
Section 3*5 (with l/n substituted for n). Table II (a) gives the values 
of AIBM for p x n systems with 2 <_ p _< 10 and 2 12. Bor p ^ 8 

and a _< 8 we have used the exact values of Bhandarkar. The rest of 
the entries were obtained by simulation. The values obtained from 
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Equations (3*16), (3.17), raid. (3.15) and their comparison with the exact 
results are shown in Table II. 

It can be seen that Baskett and Smith f s expression (3*17) is highly 
inaccurate for small values of n and p. Its accuracy improves as n and 
p increase. Both Bhandarkar's expression (3.16) and our expression 
(3.15) improve in accuracy as j increases for a constant i. Equation 
(3.15) is by far the best of the three for all values of n and p, except 
when n and p are large and nearly equal. In this range and only in 
this range is Equation (3*17) better. 

3 .7 Conclusions 

The uniform reference model has been extensively studied in the 
literature because of its simplicity. However, it does not provide 
a good approximation to the performance of real-life systems in which 
programs have strong locality of reference. The local reference model 
proposed in this chapter explicitly models this property which charac- 
terizes a majority of real-life computer programs. Our results show 
that the performance of such programs in a multiprocessor system is 
significantly worse than what is predicted by the uniform reference 
model used earlier. It would thus be worthwhile to make serious efforts 
in designing programs with uniformly random addressing patterns. Efforts 
have been made (such as j[Eer 761 ) in the context of multiprogramming 
environments to make programs more local in behavior. 5he opposite 
of this is needed for multiprocessors. Research in the design o± program 
with uniform addressing patterns could give valuable results and would 
help in inproving substantially the performance of a multiprocessor syste 



CHAPTER 4 


MODEL FOR TIME-SHARED BUS SYSTEMS 

In this chapter, we present a discrete Markov chain model for a multi- 
processor with a time-shared bus as shown in figure 3. After describing our 
notations in Section 4.1, we discuss the major assumptions of the model in 
Section 4.2. The analysis of the model is outlined in Section 4.3. An 
example is given in Section 4*4 to explain the analysis technique in detail. 
Section 4-5 presents simulation results for the example of Section 4.4 to 
show the accuracy of the analytical results. 

4.1 Notations 

We shall consider a multiprocessor system with p processors and m memory 
modules. The memory modules are of two typos as explained in the assump- 
tions to follow in the next section; there are m^ memories of type 1 and m 2 
(= m - ) memories of type 2. The processing time of a processor will be 

denoted by t , the cycle times of the two types of memories by t^ and t Q2 , 
and the bus cycle time by t lQ . The symbols «. and 8 (= 1 - «) stand for the 
parameters of the processing time distribution while Y^, ^ (= 1 - Y ^ ), 

Y 2 , and ($ 2 (= 1 - y 2 ) are parameters of the memory cycle time distributions. 
The average processing time and the average memory cycle times are V *c1 
and t c2 respectively. The unit instruction execution rates of instructions 
executed from the two types of memories are denoted by UER1 and UER2, -the 
utilization factors of these memories by Rj and P 2 , and the total unit 
instruction execution rate of the system by UER. The states of the system 
will be represented by vectors whose components are denoted by k^,k^,k 2 , 
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i » 1^2 0 ? . .., etc. The element T(i,j ) of the matrix T stands 


1 J ^"1 2 } * * * 1 

for the transition probability from state s. to state s.. The equilibrium 

X 3 

probability of state s^ is denoted by z(i). 

4.2 Assumptions of the Model 

The following major assumptions characterize the model developed in 
this chapter: 

Assumption 1 : The p processors and m memory modules of the system are 
connected by a time-shared bus with constant cycle time t^. In one cycle 
time, the bus transfers information from a processor to a memory or vice 
versa. 

Assumption 2 : The processing times of all processors are identically geo- 
metrically distributed with the distribution 

Pr C'fcp = i*"^] = 8* Ci , i = 0,1,2, ..., 


wnere 8 =1 - e f . 


The mean processing time is given by t = a.t /8. Thus any mean value of 

p D 

t can be considered by appropriately choosing et. That this is a realistic 
assumption has been shown by the models of Bhandarkar [Bha 75] who also 
argues that the processing time of a real-life processor has a distribution 
quite close to exponential. 

Assumption 3 : The memory modules eeae of two types: there are m^ memories 
of type 1 and m^ (= m - m^ ) memories of type 2. The access time of all 
memories is equal to their cycle time and the re-write time is zero (see 
Figure 6(b)). The cycle times of all memories of type j ( j = 1,2) are 
identically geometrically distributed with the distribution 



^oa =i -v = v^r 1 . 1 


1 j2 J 


where 0 . = 1 — Y . 

d '3 


She mean cycle time of a memory of type j is given by t . = t ./$. . Hie 

c 3 e 3 

difference between these distributions and the processing time distribu- 
tion should be noted. The processing time takes on values 0, t , 3t > 

*•«, vdicreas memory cycle times are not permitted to take value 0. As before, 

any mean value of t . can be chosen by appropriately choosing y . . 

J 

Assumption 3 makes our model more general than one which has all memory 
modules identical. The reason for choosing two different types of memories 
is that such a model will be needed for analysing the Pluribus system in 
the next chapter, where local memories and global memories may have different 
cycle times. Note that we can always choose t^ = t (or mg = 0) as a 
particular case. 

Assuming the memory cycle time to have a geometric distribution has 
been done to keep the analysis of the model tractable. T7e do not want to 
claim that this is a realistic assumption. Memory modules generally do have 
a constant cycle time. However, the model is robust enough for this assump- 
tion to make no significant difference. This will be borne out by the simu- 
lation results presented in Section 4.5 to validate this model. 

A ssumption 4 : All processor requests to memory are read operations. The 
processor transmits the read address over the bus to the memory module. 

A memory module can fetch only one word in one cycle. Once the data has 
been fetched, it is transmitted back over the bus to the requesting processor. 
However, the memory module is not held up while the data is being transmitted 
and is free to take up a new request as soon as the previous cycle is over. 



A ssumption 5 ; Hie requests of any one processor are uniformly distributed 
over all memory modules, regardless of their type. 

A ssumption 6 ; Each memory module has a queue in which processor requests 
are queued. After a request has been served, the next request is taken 
from the queue if it is not empty. 

Assumption 7 s Hie time-shared bus has two queues, one each for the traffic 
in the two directions. The traffic from memories to processors (in queue 1 ) 
has priority over that from processors to memories (in queue 2). 

The structure of this queueing model is shown in Figure 9. 

4 .3 Analysis of the Model 

We shall now model the process defined by the assumptions of the pre- 
vious section as a discrete Markov chain"^ . At any given time, the state of 
the system can be characterized by the lengths of the queues at the time- 
shared bus and the memory modules. This state is denoted by the vector 

( k 0 ’ k l’ k 2 ? k 11’ k 12’ *** J k 1 ,m 1 ’ k 2l ,k 22’ **’’ ^ 2 ,^ 

where k^ = number of processors in processing state j k^ = length of queue 1 

at the bus; k “length c£ queue 2 at the bus, k . = number of processors 

waiting in the queue of the i-th memory module of type j (including the 

processor being served), i = 1,2, ..., m., and j = 1,2. Note that k^ or 

J 

kg also includes the item being served by the bus, depending on the queue 
to which it belongs. The component kg of the vector is actually redundant 

Footnote ; 

1 

For a description of Markov chain processes, 
and Bhandarkar [Bha 75] . 


see ICLeinrock [He 75] , 




FIGURE 9: Structure of the queuing raodol for 
tine-shared bus systems. 
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because 


k 0 + k 1 + k 2 


2 

E 

3=1 


m. 


'3 

E 

i=1 




= p; 


it is included for the sake of convenience. 

Since all processors are identical, and all memory modules of a 
particular type are also identical, a number of these states are equi- 
valent, e.g., states (l,0,0; 1,2,0; 1,0), (l,0,0; 2,0,1; 1,0), 

(l,0,0; 1,2,0; 0,1 ), (l,0,0; 2,0,1; 0,1 ), etc. Bvery such equivalence 
class Mill be called a reduced state. 

State changes occur only at the end of bus cycles. Thus if we con- 
sider transitions between states at points just prior to the end of bus 
cycles, we get a discrete-parameter Markov chain. The procedure for 
analysing this model is as follows: 

Given the values of p,n^,m 9 , a , y^ and y all the reduced 

states must be enumerated and the transition probabilities T(i,j ) between 

every pair of states (s. ,s.) must be found. The method of enumerating the 

3 

reduced states and finding the transition probabilities is similar to that 
employed in the crossbar switch models of the previous chapter; a syste- 
matic technique for it has also been given by Bhandarkar [ Bha 75] . The 
equilibrium state probabilities z(i) can now be evaluated by solving the 
set of equations 

z(i) = s z(j ) . T( j ,i ) (4.1 ) 

3 


^ z(i) = 1 ^4.2) 


together with 
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The unit instruction execution rate (UER) or any other performance measure 
one is interested in can be calculated from a knowledge of the equilibrium 
state probabilities. 

This outline of the analysis method will be explained in detail by 
the example of the next section. 

4.4 An Example 

Consider a system with 2 processors and 3 memory nodules, 2 of which 
belong to type 1 and one is of type 2. Thus p = 2, =2 , and = 1 . 

For this case, there are 16 reduced states: 

s 1 = ( ci , 0 , 0 ; 0,0; o), s 2 = ( 1 , 1 , 0 } 0,0; 0), 

Sj = ( 1,0,1 } 0,0; o), s^ = (l ,0,0; 1,0; 0), 

Sp- = (l , 0 , 0 ; 0 , 0 ; 1 ), Sg = ( 0 , 2 , 0 ; 0 , 0 ; 0 ), 

Srj = (o, 1 , 1 ; 0,0; 0), Sq — (0,1,0; 1,0; o), 

Sg = (o, 1,0; 0,0; 1 ), ^10 = 0,0; O), 

s^ = (0, 0, 1 ; 1,0; 0), ^12 ~ (0,0,1; 0,0; 1 ), 

s^g = (0,0,0; 2,0; 0), S 14 = (0,0,0; 1,1; 0), 

s l5 = (0,0,0; 1,0; 1 ), s l6 = (0,0,0; 0,0; 2). 

The transition matrix for these states is given in the Appendix in 
terms of parameters 04, , Ygj and B = 1 _ o', 5^ = 1 ~Y-]» = 1 -Yg* 

A computer program can now be written which, given values for these parameters 
will solve the set of simultaneous linear eque-tions (4.1 ), (4.2), and compute 
the equilibrium state probabilities z(i). The unit instruction execution 
rate (UER) con now be found in the following way. 



49 


During any given period of tine, every instruction executed by the 
systen keeps the bus busy for two cycles: once for transferring the address 
to the memory, and again for transmitting the data back to the processor. 
Hence the sum of the equilibrium probabilities of those states during which 
the bus is serving an item from queue 1 will give the average number of 
instructions executed in one bus cycle of duration t^. These states are 
s^jSg jS^Sg, and s^. Alternatively, we nay find the sum of the equilibrium 
probabilities of those states during which the bus is serving an item from 
queue 2. Since queue 1 has priority over queue 2, these states arc s^, 
s 1Q , s , and s^. Thus we get, 

UER = (z(2) + z(6) + z(7) + z(s) + z(s))/t^ 

= (z(3) + z(l 0) + b( 11) + z(l 2 ) )/t fe . 

\7e are also interested in finding UER1 and UER2, the average rates of 
instructions executed from memories of type 1 and from memories of type 2 
respectively. Clearly, UER = TJER1 + UER2. 

The sun of the equilibrium probabilities of all states weighted by the 
number of memories of type j busy during that state -gives the utilization 
factor P. of those memories. Thus 

t) 

= 2(4) + z(8) + z ( 1 1 ) + z(l3 ) + 2.z(l4) + z(l5) 

and P 9 = z(5) + z(9) + z(l2) + z(l5) + z(l6). 

Since the average cycle time for memories of type j is t c _. = > 0214 

one instruction is fetched by one memory nodule in one cycle, the average 
rate of instructions executed from memories of type 3 is 


UERd = p/im = P 3 
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TABLE III 

EESULTS FOR TIIffi-SHARED BUS MODEL 





P = 2, ix, 

— 2 j n 2 — 1 y 

t . = 600 
cl 

nsec . 


ns go 

nsec 

*c2 

nsec 

Analytic Results 


Simulation 

Results 

UER 

insts/ 

ysee 

Percentage 

.‘Irror 

UER1 

insts/ 

lieec 

UER2 

insts/ 

Usee 

USR 

insts/ 

usee 

120 

150 

300 

1 .2514 

0.6257 

1 .8772 

1 .928 

2.71 



600 

1.1309 

0 .5655 

1 .6964 

1.785 

5-22 



900 

1 .0193 

0.5097 

1 .5290 

1 .593 

4.19 


300 

300 

0.9122 

0.4561 

1 .3683 

1 .394 

1 .88 



600 

0.8514 

0.4257 

1 .2772 

1.337 

4 - 68 



900 

0.7926 

0.3963 

1 .1889 

1 .232 

3.63 

240 

150 

300 

1 .1455 

0.5728 

1 .7183 

1 .758 

2.31 



600 

1 .0433 

0.5217 

1 .5658 

1 .624 

3.77 



900 

0.9484 

0.4742 

1 .4226 

1.473 

3.54 


300 

300 

0.8630 

0.4315 

1 .2945 

1-304 

0.73 



600 

0.8070 

0.4035 

1 .2105 

1 .248 

3.10 



900 

0.7535 

0.3767 

1 .1302 

1 .165 

3.08 

360 

150 

300 

1 .0508 

0.5254 

1 .5763 

1 .621 

2, 8*4 



600 

0.9643 

0.4822 

1 .4465 

1 .482 

2.45 



900 

0.8834 

0.4417 

1 .3251 

1 .382 

4.29 


300 

300 

0.8130 

0.4065 

1 .2195 

1.235 

1.27 



600 

0.7626 

0.3813 

1 .1439 

1 .176 

2.81 



900 

0.7146 

0.3573 

1 .3719 

1 .095 

2.16 
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150 

300 

0.9681 

0.4841 

1 .4522 

1 .479 

1 .85 



600 

0.8944 

0.4472 

1 .3416 

1.386 

3.31 



900 

0.8251 

0.4126 

1 .2377 

1 .282 

3.53 


300 

300 

0.7657 

0.3829 

1 .I486 

1 .150 

0.12 



600 

0.7207 

0.3603 

1 .0810 

1 .097 

1 .40 



900 

0.6777 

0.3388 

1 .01 65 

1 .032 

1 .52 

600 

150 

300 

0.8960 

0.4480 

1 .3441 

1 .385 

3.04 



600 

0.8328 

0.4164 

1 .2492 

1 .279 

2.39 



900 

0.7730 

0.3865 

1.1595 

1.175 

1 .34 


300 

300 

0.7221 

0.3610 

1 .0831 

1 .083 

0.01 



600 

0.6818 

0.3409 

1 .0227 

1 .045 

2.18 



900 

0.6433 

0.3217 

0.9650 

0.976 

1.14 

1200 

150 

300 

0.6472 

0.3236 

0.9709 

0.955 

1 .64 



600 

0.6143 

0.3071 

0.9214 

0.931 

1 .04 



90 J 

0.5824 

0.2912 

0.8736 

0.885 

1.30 


300 

300 

0.5552 

0.2776 

0.8328 

0.828 

0.58 



600 

0.5512 

0.2656 

0.7968 

0.818 

2.66 



900 

0.5080 

0.2540 

0.7620 

0.766 

0.52 

1800 

150 

300 

0.5040 

0.2520 

0.7560 

0.751 

0.66 



600 

0.4841 

0.2420 

0.7261 

0.727 

0.12 



900 

0,4646 

0.2323 

0.6970 

0.682 

2.15 


300 

300 

0.4477 

0.2239 

0.6716 

0.685 

2.00 



600 

0.4321 

0.2161 

0.6482 

G.64O 

1.27 



900 

0.4169 

0.2085 

0.6254 

0.6p, „ 

°..Si w . 

% 







Cf£ ; 

' A Hi 








5488 * 







%3t2» ** 




2400 


3000 


TABLE III 


150 

O 

o 

0.4119 

0.2060 


600 

0.3987 

0.1994 


900 

0.3857 

0.1928 

300 

300 

0.3742 

0.1871 


600 

0.3633 

0.1816 


900 

0.3526 

0.1763 

150 

300 

0.3181 

0.1740 


600 

0.3386 

0.1693 


900 

0.3294 

0.1647 

300 

300 

0.3210 

0.1605 


600 

0,3130 

0.1565 


900 

0.3052 

0.1526 


52 

(Con tinued) 


0.6179 

0.608 

1 .60 

0.5981 

0.595 

0.52 

0.5785 

0.575 

0.61 

0.5612 

0.542 

3.42 

0.5449 

0.548 

0.57 

0.5290 

0.536 

1.32 

0.5221 

0.527 

0.94 

0.5080 

0.505 

0.59 

0.4940 

0.490 

0.81 

0.4815 

0.483 

0.31 

0.4695 

0.475 

1 .17 

0.4578 

0.447 

2.36 
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Table III shows the values of UER, UER1 and UER2 computed from this 

model for t^ = 600 nsec; and various values of t^, an< ^ ^ c ^ ose 

observation of this table shows that 

UERj = UER . cu/(m.j + m^). 

This actually follows from Assumption 5, namely that the processors 1 requests 
are unifomLy distributed over all memory modules* The difference in speed 
of the two types of memories will affect the total instruction execution rate, . 
but each memory module will continue to receive its share of processor 
request So 

4 *5 Simulation Result s 

Simulation studies were conducted to validate this model for tine— 
shared bus systems. The simulation program assumed constant memory cycle 
times so that we could estimate the effect of Assumption 3 (geometrically 
distributed memory cycle times). The program wa s written in PORTRAIT IV and 
run on an IBM 7044. To find the steady-state system performance, the ins- 
truction execution rate was averaged over a total of 5000 memory cycles. 

The results are shown in Table III along with the analytical results. It 
can be seen that in no case does the error exceed 6 percent. Thus we can 
say that the assumption of geometrically distributed memory cycle times 
is acceptable. The closeness of the analytical and simulation results 
demonstrates the usefulness of the model# 



CHAPTER 5 


MODEL FOR PLURIBUS SYSTEMS 

In this chapter, we develop a nodel to analyse the performance of 
the Pluribus multiprocessor system which wa,s described in Chapter 2. 

The notations used in this chapter are described in Section 5.1. The 
major assumptions of the model arc outlined in Section 5.2. Section 5.3 
gives the detailed analysis of the model. The analytic results are pre- 
sented in Section 5.4 and are compared with the simulation results in 
Section 5 .5 . 


5 .1 Potations 

For the Pluribus model in this chapter, we shall consider the system 
configuration shown in Figure 10. We assume the following parameters for 
thi s nodel : 

n^ = number of processors on each processor bus, 

= number of local memory nodules on each processor bus, 

n = number of global memory nodules, 
mg 

n ^ = number of processor buses, 

t = processing time of each processor, 

t^ = cycle tine of each local memory module, 

t = cycle time of each global memory nodule, 
eg 

and t^ = bus cycle tine. 

It should be noted that the system modelled here (Figure 10) differs 
in one respect from the actual configuration of the Pluribus system shown 
in Figure 5 • The crossbar switch, instead of connecting to memory buses. 
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PIGUSE 10: Pluribus configuration for the no del 
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hero connects directly to globe!, nenory nodules. 

These two systens nay be considered equivalent through, the following 
transformation : let the system in Figure 5 have n nenoiy buses with bus 
cycle tine t^ and each having n^ nenory nodules, let the cycle tine of 
these nenory nodules be t Q . This system is then equivalent to the one 
in Figure 10 where the number of global nenory nodules is n = n^ . n^ 
with nenory cycle tine t = t + 2t . . The tern 2t , in the cycle tine 
reflects our assumption that two bus cycles are necessary to access one word 
from the memories: one for transferring the address to the nodule and the 
other for transmitting the data back to the processor. 

This transformation is approximate, however . It neglects the queuing 
delays due to conflicts for the memory buses. If the number of memories 
per bus is not large and the buses are fast enough, this approximation is 
justified. 

5 .2 As sumptions of the Model 

The following major assumptions will be made for this model: 

Assumption 1 : The crossbar switch has zero delay. However , crossbar switches 
with nonzero delay nay be nodelled by simply adding the delay to i C gj ^e 
global nenory cycle tine. 

Assumption 2 : For oil memory nodules, local as well as global, the access 
time is equal to the cycle tine and the rewrite time is zero. (See Figure 
6(b)). 
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Assumption 3 t Any particular processor can access the local nenory nodules 
on its own bus as well as all the global nenory nodules. These accesses 
have the following distribution : Each local nenory nodule is accessed 
with probability l/(n^ + 1 ). Each global nenory nodule is accessed with 
probability l/(n (n^ + l))« There are two reasons for assuning this 
distributions Pirst, in any conputer systen where nenory is divided into 
local and global parts, accesses from local nenory are generally nore fre- 
quent than those fron global nenory* The distribution we have assuned 
conforms to this pattern: the total probability assigned to accesses fron 
all the global nenory nodules is equal to the probability assigned to 
accesses fron one local nenory nodule. The second reason is that, as vd.ll 
be clear in the next section, this distribution will pernit us to use un- 
changed the bus nodel developed in chapter 4. 


5 .3 Analysis of the Model 

We now give a net hod for analysing the performance of this nodel of 
the Pluribus systen. We have seen in Chapter 3 that there are a number of 
models for analysing the crossbar switch systen. We have also developed, 
in Chapter 4, an analytic nodel for the tine-shared bus systen# Since the 
Pluribus is a composite of a crossbar switch and tine-shared buses, we 
shall try to decompose it into these components and use tne models for 
these to analyse the performance of the whole. However, the bus and the 
crossbar subsystems are not isolated; there is an interaction between then 
and our model must take this into account* This is done in the following 


manner. 
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Consider a processor bus; let us call it B1 . This bus interacts 
with the rest of the systen (nanely, the renaining processor buses, the 
crossbar switch, and the global nenory nodules) by sending requests for 
accessing the global nenories. These requests are serviced after a certain 
tine, which depends on the global nenory cycle tine and the interference 
due to requests fren. other processor buses. Thus, as far as bus B1 is 
concerned, the rest of the systen sinply looks and acts like a nenory 
nodule with a certain cycle tine, say t . This is true for each of the 
processor buses. Hence the Pluribus systen nay be represented as shown 
in Figure 11. The systen now consists of n ^ independent processor buses, 
each with n^ processors vdth processing tine t , n__^ local nenories with 
cycle tine t^, and a 'virtual' nenory nodule with cycle tine t cv , which 
replaces the rest of the systen. 

The problem, is to deternine the average value of this cycle tine t Qv 
of each of these 'virtual' nenories. To do this, let us look at the systen 
fron the viewpoint of the crossbar switch. As far as the crossbar switch 
is concerned, each processor bus sinply acts as o. processor. It sends a 
request for accessing data fron global nenory, and after the request is 
satisfied, takes a certain average processing tine, say t^, before sending 

Footnote : 

The tern 'virtual' nenory used here denotes the fact that this is not 
a real nenory nodule and should not be confused with its conventional nearing 
in ccnputor systems architecture. 



59 


Processors 




.bocal Men cry 
Modules 


Processor 

Buses 




FIGURE 11 : Pluribus systen with ‘virtual’ nenory nodules (liv) 
replacing the crossbar switch conponent. 
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FIGURE 12; Crossbar conponent 'of the Huribus systen 
with 'virtual’ processors (PV) replacing 
each processor bus. 


’Virtual ’ 
Processor 



’ Virtual ’ 
nenory nodule 


FIGURE 13: Interaction between the bus and crossbar conponent s 
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global memories. In Figure 13? let d(t , t^) ^ le ra ^ e 
’virtual ’ processor executes instructions from the 'virtual 


Cl curly, 


b( 


n , 
P 


n 


nl ! 


J cl’ V 


t ) 
cv 


° (n pb’ n 


ng 


t , t ) = 
eg 7 pv 


at which the 
’ nenory. 


d(t 


pv 


t 

cv 


) 


(5.1 ) 

If the functions b,c, and d are known, then we have a set of two 
equations with two unknowns, t and t . These con. be solved for end the 

/ py Cv 

perfo rman ce of the total systen is then found as follow/s. 

As we noticed at the end of Section 4«4 of Chapter 4, for ea.cn pro- 
cessor bus, the rate of instruction execution fron all nenory nodules 
(including the ’virtual’ nenory) is the sane; it is thus given by d(t_^_,t^) • 


Hence the total unit instruction execution rate for each bus is 

(n nl +l) ' d(t pv’ t cv ) ' 

Since there are n buses in all, the total UER for the Pluribus systen is 

Jr 


UER = n pb . (n^ + 1 ) . a(t^, t^) 


(5-2) 


It now remains to solve the systen of Equations (5 .1 )• l n order to do 
this, we oust know the functions b,c, and d. The function d is simply given 


by 

d(t , t ) = 1/(t + t ) (5.3) 

^ pv’ cv' x pv cv' 

For the crossbar switch, we nay use Strecker’s model [ Str 70] described 
in Chapter 3. The function c is then given by Equation (3 - 1 ) 5 this equation 


is repeated here: 
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n 


c(» , , n , t , t )=(n _/t )(l - (l - P /n ) pb ) 

pb’ ng’ eg’ pv' v ng cg M v ri ng' ' 


where 


n 


P n + <VV ( V%S )(l ' (1 ' W P > - 1 = 0 < 5 -C 

The value of this function can be computed for any given argument values 
by solving the above equation. 

However, the function b is not known analytically. It can be computed 
for any given argunent values by using the node! developed in Chapter 4 


(with p = n nf rx, = n^ , n 0 = 1 , t „ = t^, t Ql = t 


p' h i nl ? “2 ' p 


J cl ? c2 


= t , and t, = t, . 

CV 7 U 


The value of b is simply UEE2). 

Keeping into view the consideration that not all the f motions are known 
analytically, it is possible to solve Equations ( 5 .1 ) by the following itera- 
tive method* 


Choose an initial value for ^ Qy * Vfe shall choose t^ = t « 

t fron the equation 
pv 

b(n , n - , t j t _ , t, , t ) = d(t * t ) = 1/(t + t ) 

p ? ml 7 p 7 cl* b 7 cv pv 7 cv pv cv 


Solve for 


1*G« 


t = 7“ - t * 
pv b cv 

Using this value of t , solve for t fron the equation 
° pv cv 

c(n , « n ,t ,t ) = d(t , t ) = l/(t + t ) 

\i ra cr * r\cr " rvtr / ' rvt r 7 otr ' / YYXr Q V 


pb 7 ng 7 eg 7 pv 


pv cv 


pv 


t 

cv 


t 

pv 


1 

c 
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One iteration is now conpleted. Use this value of t to begin the 

next iteration, Repeat this process until the value of t calculated at 

the end of an iteration is equal to the value of t with which the itera- 
tion was begun. 

The convergence of this process will, of course, depend on the func- 
tions b and c. These functions are not -known analytically, so it is diffi- 
cult to - say anything definite about the convergence. Since the proof of 
the pudding is in its eating, we wrote a computer program to implement the 
complete method described in this section. The function b was calculated 
by solving numerically the set of simultaneous linear equations (4-1 ) ? 

(4.2). The function c was computed by numerically solving Equation (5.4) 
using the bisection method (it is known that the root P lies between 0 and 1 ) 
In the more than four hundred cases on which we tried this method, it never 
took more than 8 iterations, sometimes taking only 2 Iterations. The 
FORTRAN iv program took about 2 minutes compilation tine and execution tine 
of the order of 2 seconds for each analysis on an IBM 7044. A few sample 
outputs are shown in Table IV. 

5 .4 Analytical Results 

Since the ELuribus model has 8 input parameters, each of which can vary 
over a fairly wide range, it is impossible to obtain results covering the 
whole parameter space or even a significant part of it, For this reason, 
we limited ourselves to investigating the parameter space in the vicinity of 
the values of the actual ELuribus system built by BBN. This system is 
characterized by the following parameter values: 11 = 2, = 2, - 4, 

n pt» = 7 ? \ = 1425 nsec -> t cl = 850 nsec., t Qg = 1250 nsec., t fe = 200 nsec. 
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TABLE IV 

SAMPLE OUTPUTS OF ITERATIVE METHOD FOR AEAI.YSIITG- 
PIURIBUS SYSTEM 


Par one ter 

Values 
(nil tines 
in nsec . ) 

Bo. of 
itera- 
tion 

t 

CV 

nsec 

b 

insts/ 

^sec 

t 

pv 

nsec 

c 

insts/ 

U sec 

t 

cv 

nsec 

Total 

UER 

insts/ 

T-feec 

n = 2, n , = 2, 
p ? nl ; 

1 

1250 

0.2265 

3166 

0.2098 

1602 


n = 4, n = 7, 
n& ’ pb ’ 

2 

1602 

0.2155 

3038 

0.2149 

1614 


t = 1425, t =850, 5 

p cl 

1614 

0.2152 

3033 

0.2151 

1615 


t = 1250, t. = 200. 4 

eg ’ b 

1615 

0.2151 

3033 

0.2151 

1615 

4.5180 

n =2, n _ =2, 
p nl 

i 

850 

0.2724 

2821 

0.2391 

1361 


n =2, n =7, 
ng ’ pb 

2 

1361 

0.2505 

2631 

0.2479 

1403 


t =1425, t i=400, 
p ’cl ’ 

3 

1403 

0.2488 

2616 

0.2486 

1407 


t =850,t =200. 
eg ’ b 

4 

1407 

0.2487 

2615 

0.2486 

1407 

5.2217 

n =2,n , =2, 
p ’ ml 

1 

425 

0.2539 

3513 

0.2523 

450 


n =4 , n =5 , 
ng ’ pb 7 

2 

450 

0.2530 

3502 

0.2530 

450 

3.7957 


t =1425, t -,=850, 
p cl 



t =200. 
b 
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TAHLB IV (COmiUED) 


CvJ 

11 

$ 

C\J 

li 

a* 

1 

1250 

0.1965 

3838 

0.1878 

I486 


n =4 5 n =6, 

rig 9 pb 7 

2 

I486 

0.1912 

3745 

0.1910 

1492 


t p =1425,t cl =850, 
t cs = 12 5°, t b =400. 

3 

1492 

0.1910 

3742 

0.1911 

1492 

3.4388 

n =2,n =2, 

p 7 nL 9 

1 

1250 

0.3059 

2019 

0.2396 

2154 


n =4,n =10, 

ng 7 pb 7 

2 

2154 

0.2562 

1749 

0.2498 

2254 


t =600,t _ =850, 
p 9 cl 9 

3 

2254 

0.2514 

1724 

0.2508 

2264 


t =1250, t =200. 
eg 7 b 

4 

2264 

0.2509 

1721 

0.2509 

2265 



5 

2265 

0.2509 

1721 

0.2509 

2265 

7.5259 

n =2,n _ =2, 
p 7 nl 7 

1 

1 000 

0.2345 

3264 

0.1422 

3769 


n =1 ,n =7 ? 
ng 7 pb 7 

2 

3769 

0.1613 

2431 

0.1428 

4573 


t p =1425,t ol =850, 

3 

4573 

0.1463 

2263 

0.1428 

4740 


t ce =1000,t b =200. 

4 

4740 

0.1434 

2231 

0.1428 

4771 



5 

4771 

0.1429 

2226 

0.1428 

4776 



6 

4776 

0.1428 

2225 

0.1428 

4777 



7 

4777 

0.1428 

2224 

0.1428 

4778 



8 

4778 

0.1428 

2224 

0.1428 

4778 

2.9990 
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itfote that and t^ are obtained by applying the tronsf omation nentioned 

at the beginning of Section 5.1 (with n = 2, n = 2, t , = 200 nscc., 

nb bg J nb ’ 

"^eng = nscc • ) The value for t is the processing tine per unit ins- 
truction (see definition of unit instruction in Section 2.5) arvi is obtain- 
ed fron the fact that the Lockheed SUE processors used by 3311' have a 3*7 
y sec add or load tine (we took this to be a typical instruction) which is 
equivalent to two unit instructions containing twe accesses fron memory 
(with access tine 425 nsec.) 

These results ore presented in figure 14. The ports of this figure 
represent various cases of interest and are discussed below. 

Zi ffliPes 14(a) and (b) : It is evident fron these figures that no signi- 
ficant change in the performance is effected byr charging the processor 
speed, for a constant processor speed, however, the performance increases 
almost linearly with the number of processor buses. This shows that the 
global memories do not constitute a bottleneck in the system. ITote that 
these figures are for a. system with 4 global memories. Yihen the number 
of global memories is less, however, this statement does not hold, as we 
shall see below. 

figures 14(c) and (d) ; An observation similar to the above holds for the 
bus speed. These figures show that the effect of varying the bus speed 
is negligible as compared to the effect of varying the number of buses. 
figure 14(e) : This is a very interesting figure. It shows that increasing 
the number of global memories from 1 to 4 improves the performance signi- 
ficantly. Beyond that, however, it is useless to further increase this 
number. On the other hand, increasing the global memory speed considerably 



UEK in iusWnicrosoCc — UER .in inots/uicroscc 



(b) 


FIGURE 14: 


Analytical results far tho ELuribus nodel. 
(all tines in nsec . ) 
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(c) 



(a) 


FIGURE 14: (Continued) 

(all tines in nsec . ) 


4^ O 




fJER in insts/nicrosec , 


71 



(f) 


FIGURE 14 • (Continued) 

(all tines in nsoc . ) 





in inst s/nicr oscc 
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(g) 


FIGURE 14: (Continued) 

(all tines in nsec. ) 


O ouo !> VD 



UER in insis/bicro 
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UER in insts/nicr'. 
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(i) 


FIGURE 14= (Continued) 

(all tines in nsec.) 
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improves tne performance irrespective of the nunher of global memories. 

Figure 14(f) ° This figure shows how the p erf omance varies with the local 
memory speed for various global nenory configurations. Clearly, 2 global 
memories with cycle tine 850 nsec, are better than 4 of 1250 nsec. The 
local nenory speed also has an inpact on the perf omance. 
figure 14(g) and (h ) s These figures, together with the observations unde 
earlier, show that, although faster global and local nenories ore better, 
the predominant effect on the performance is that of the number of pro- 
cessor buses, figure 14(f)? however, gives a better commentary on the 
interaction between these factors. 

figure 14 (i)s In this figure, the local and global nenory speeds we re 
kept equal. It is seen that for slower nenories, with 2 global memories, 
the performance tends to saturate when the number of processor buses is 
increased. For faster memories, there is hardly any difference betv/een the 
curves for 2 and 4 global nenories* In this region, there is an almost 
linear performance increase with the number of buses. 

Since the system built at BBIT has 4 global nenories, we can interpret 
these figures to mean that the system perf ornance cam be improved by speeding 
up the memories, but a spectacular linear improvement can be achieved by 
increasing the number of processor buses. However, increasing the number of 
global memories and increasing the processor and bus speeds will not have any 
significant impact on the performance* It also seems reasonable to assert 
that, just as in crossbar switch systems, the performance of the Pluribus 
can be increased without any law of diminishing returns so long as the 
processor-memory bandwidths are kept matched. 
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TABLE V 

GCMPARI SOW OF AITALYTIC AID SMULaTIOW 
RESULTS ECR PLURIBUS MODEL 


n n , t t n t t_ 

^ nsBc nsec nse8 nsec 

1 7 1425 850 425 200 

1 7 1425 850 1000 200 

2 4 1425 400 400 200 

2 4 1425 1000 1000 200 

2 5 1425 500 500 200 

2 5 1425 1200 1200 200 

2 6 1425 600 600 200 

2 6 1425 HOO 1400 200 

2 7 1425 400 850 200 

2 7 1425 500 500 200 

2 7 1425 500 1250 200 

2 7 1 425 85 0 5 00 200 

2 7 1 425 850 1 200 200 

2 7 1425 1000 850 200 

2 7 1425 1200 1200 200 

2 7 1425 1200 1250 200 

2 8 1425 700 700 200 

8 1425 1600 1600 200 


Analytic SiiTulotion 

UER UER Percentage 

inst s/ inst s/ Error 

usee usee 


5.1394 

5.2247 

1.66 

2.9990 

3.0273 

0.94 

3.4923 

3.5740 

2.34 

2.6235 

2.7160 

3-53 

4-1347 

4.2296 

2.30 

2.9519 

3.0602 

3.67 

4.6846 

4.7943 

2.34 

3.1497 

3.2194 

2.21 

5.2217 

5.2705 

0.93 

5.7337 

5 .8423 

1 .89 

4.2306 

4.1876 

1 .02 

5.1852 

5.3191 

2-58 

4.1354 

4.1471 

0.28 

4.5636 

4.6654 

2.23 

3.9112 

3.9723 

1 .56 

3.8408 

3.8837 

1 .12 

5.7890 

5 .8889 

1 .73 

3*4393 

3.3811 

1.69 


2 


TABLE V (Continued) 
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2 

9 

1425 

800 

800 

200 

2 

9 

1425 1800 

1800 

200 

2 

10 

1425 

400 

400 

200 

2 

10 

1425 1000 

. — s« 

o 

(Jl 

o 

200 

3 

7 

1425 

850 

700 

200 

3 

7 

1425 

850 

1400 

200 

4 

4 

600 

850 

1250 

200 

4 

4 

1400 

850 

1250 

200 

4 

4 

1425 

500 

503 

200 

4 

4 

1425 

1200 

1200 

200 

4 

5 

800 

850 

1250 

200 

4 

5 

1425 

400 

1250 

200 

4 

5 

1425 

600 

600 

200 

4 

5 

1425 

850 

425 

200 

4 

5 

1425 

850 

1000 

200 

4 

5 

1425 

850 

1250 

100 

4 

5 

1425 

850 

1250 

250 

4 

5 

14-25 

1000 

1250 

200 

4 

5 

1425 

1400 

1400 

200 

4 

5 

1600 

O 

ir \ 

CO 

1250 

200 

4 

6 

1000 

850 

1250 

200 

4 

6 

1425 

500 

1250 

200 

4 

6 

1425 

700 

700 

200 


85 0 5 00 200 


5.9993 

5.9862 

0.22 

3.2232 

3.1259 

3.02 

8.5542 

8.6715 

1.37 

5.5554 

5.3506 

3.69 

5-0239 

5.1593 

2.70 

4.2161 

4.3191 

2.44 

3.5119 

3.7111 

5.67 

2.6792 

2.7807 

3-79 

3.3387 

3.4220 

2.50 

2.4617 

2.5832 

4.94 

4.0230 

4.2322 

5.20 

3.6726 

3.7750 

2.79 

3.9717 

4.0837 

2.82 

3.7957 

3.8960 

2.64 

3 .4497 

3.5675 

3.41 

3-5210 

3.6529 

3.75 

3.1861 

3.2998 

3.57 

3.1783 

3.3010 

3.06 

2.8291 

2.9790 

5*30 

3.1291 

3.2344 

3.37 

4.4532 

4.6649 

4.75 

4.2484 

4.3916 

3.37 

4.5342 

4.6714 

3.03 

4.4969 

4.6144 

2.61 


4 


6 1425 
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4 

4 

4 

4 

4 

4 

4 

,1 

l 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 

4 


TABLE V (Continued) 


6 

1425 

850 

1 200 

6 

1425 

850 

1250 

6 

1425 

050 

125 0 

6 

1425 

1200 

1250 

6 

1425 

1600 

1600 

6 

1800 

850 

1250 

7 

1200 

850 

1250 

7 

1425 

600 

600 

7 

1425 

600 

850 

7 

1425 

700 

1250 

7 

1425 

850 

700 

7 

1425 

850 

850 

7 

1425 

850 

1250 

7 

1425 

850 

1250 

7 

1425 

850 

1400 

7 

1425 

850 

1600 

7 

1425 

1403 

850 

7 

1425 

1400 

1400 

7 

1425 

1600 

1250 

7 

2000 

850 

1250 

8 

600 

850 

1250 

8 

1400 

850 

1250 


200 

3.9549 

4.0847 

150 

4.0436 

4.2021 

300 

3.6685 

3.7934 

200 

3.6110 

3.7622 

200 

3.1233 

3-2740 

200 

3.5238 

3.6292 

200 

4.8198 

5.0014 

200 

5.5335 

5.6743 

200 

5.2525 

5.3717 

200 

4.6734 

4-8314 

200 

5.0582 

5 .2028 

200 

4.9162 

5.0812 

200 

4.5180 

4.6748 

350 

4.1100 

4.2252 

200 

4.3653 

4.5071 

200 

4.1623 

4.3073 

200 

4.2730 

4.4556 

200 

3.8893 

4.0834 

200 

3.8272 

3.9749 

200 

3.8727 

3.9734 

200 

6.4194 

6.5819 

200 

5.1364 

5-3035 


3.28 

3.92 

3.40 

4.19 

4.83 

2.99 
3.77 
2.54 

2.27 
3.30 
2.06 
3.36 

3.47 
2.80 
3.25 

3.48 

4.27 

4.99 
3.86 
2.60 
2.53 
3-25 


4 


79 


TABLE Y (Gontimed) 


4 

8 

1425 

600 

1250 

200 

5.3841 

5 -5-233 

2.59 

4 

0 

1425 

80 J 

800 

200 

5.7238 

5.9070 

3-20 

4 

8 

1425 

850 

850 

20 : 

5.5904 

5.7609 

3.05 

4 

0 

1425 

850 

1250 

250 

4.9489 

5.0800 

2.65 

4 

8 

1425 

850 

1250 

400 

4.5145 

4.6487 

2.97 

4 

8 

1425 

850 

1600 

2C0 

4.6600 

4.8125 

3.27 

4 

8 

1425 

1400 

1250 

200 

4.5303 

4.7251 

4.30 

4 

8 

1425 

lOOu 

1000 

200 

3.7738 

3.9344 

4 .26 

4 

9 

800 

850 

1250 

200 

6.6645 

6 .8668 

3.04 

4 

9 

1425 

400 

400 

200 

7.8388 

8. 0010 

2.07 

4 

9 

1425 

700 

1250 

200 

5.8394 

5.9483 

1 .86 

4 

9 

1425 

850 

425 

20^ 

6.8054 

6.9755 

2.50 

4 

9 

1425 

850 

1 000 

200 

6.0392 

6.1993 

2.65 

4 

9 

1425 

850 

1250 

100 

5.9905 

6 .1348 

2.41 

4 

9 

1425 

850 

1250 

300 

5 .3408 

5 .4739 

2.49 

4 

9 

1425 

1000 

1000 

200 

5 .8283 

6.0256 

3-39 

4 

9 

1425 

1600 

1250 

200 

4.8460 

5 .0061 

3.29 

4 

9 

1600 

850 

1250 

200 

5.4129 

5.5169 

1 .92 

4 

10 

1000 

850 

1250 

200 

6.8713 

7.0482 

2.57 

4 

10 

1425 

500 

500 

200 

8.2539 

8.4320 

2.16 

4 

10 

1425 

800 

1250 

200 

6 .2566 

6.3687 

1 .79 

4 

10 

1425 

850 

500 

200 

7.4499 

7.6367 

2.51 

4 

10 

1425 

850 

1200 

200 

6.2886 

6 .4082 

1 .90 



TABLE V (Pont inuecQ 


4 

10 

1425 

850 

4 

10 

1425 

050 

4 

10 

1425 

120 . 

4 

10 

1425 

lOGv .' 

4 

1 J 

1800 

050 

4 

11 

1200 

850 

4 

11 

2000 

850 

4 

12 

600 

850 

4 

12 

1400 

850 

5 

7 

1425 

050 

5 

7 

1425 

050 

6 

7 

1425 

050 

6 

7 

1425 

850 


1250 

150 

6.3647 

1250 

350 

5.6975 

1200 

20 : 

5.8544 

1250 

200 

5.1311 

1250 

20 °' 

5.6562 

1250 

200 

7.0501 

1250 

200 

5.8728 

1250 

200 

8.3477 

1250 

200 

7.2054 

425 

20 C 

5.3104 

1 000 

O 

o 

CM 

4.8073 

500 

200 

5.2536 

1200 

200 

4.6571 


6 .4345 

1 .88 

5.8412 

2.52 

5.9969 

2.43 

5 .3574 

4.41 

5.7656 

1 .93 

7.0362 

0.20 

5.9830 

1.89 

8.2118 

1 .65 

7.2179 

0.17 

5 .4478 

2.59 

4.9641 

3.26 

5.3944 

2.68 

4.8268 

3.64 
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5 *5 Simulation Result s 

Simulation studies were conducted to validate the nodel for the Pluribus 
systen presented in this chapter. The simulation program was vjritten in 
FORTRAN IV and run on an IBM 7044. The program made only the three assump- 
tions listed in Section 5.2; it took constant processing tines as well as 
constant nenory cycle tines. The instruction execution rate was averaged 
over a total of 5000 nenory cycles. This amounted to the processing of 
anywhere between 5000 and 32000 instructions (approximately) by the multi- 
processor systen, depending on the systen parameters. 

The simulation results together with the analytic results for sone 
representative cases are shorn in Table V. It can be seen that the errors 
arc all below 6 percent. This verifies the accuracy and usefulness of the 
analytic ELuribus nodel. It also shows that the approximations nade in 
replacing parts of the systen by 'virtual 1 memories a.nd 'virtual 1 processors 
(see Section 5.3) are reasonable and do not have a significant impact on the 
accuracy of the nodel. 



CHAPTER 6 


CONCLUSION'S 

We shall now summarize the nain results obtained in this thesis and 
suggest directions that future work in this area nay take. 

In Chapter 3, wc discussed a nodel for crossbar* svd-tch. systens which 
bakes into account the local referencing property that chair.cterizes most 
conputor programs. It wa s found that the performance of multiprocessor 
systems with the local reference model is worse than that predicted by the 
traditional uniform reference nodel. We also derived new expressions for 
the uniform reference model and compared then with expressions available 
in the literature. Our expressions we re more accurate than, the existing 
expressions in most cases. 

In Chapter 4, we presented a Markov chain model for multiprocessor 
systems using a time-shared bus. For reasons mentioned in Chapter 2, a 
single til e-shared bus is not generally used in large multiprocessor systems. 
However, it is useful to have such models, not only because they help us in 
analysing other systems such as the Pluribus, but also because they aid in 
the understanding of evaluation techniques for multiprocessors. 

In Chapter 5, we presented an analytic model for the Pluribus system. 
This model decomposes the Pluribus into its components comprising the cross- 
bar switch and the processor buses. The performance of the total system is 
calculated in an iterative way from the performance of the two different 
components. In presenting our results, we used the model for the time- 
shared bus developed earlier and an expression derived by Strecker for the 
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crossbar switch. We would now like to point out that the iterative nethod 
used in the nodel is independent of the nodels or expressi -ns that nay be 
used for calculating the performances of the component systems. If, as 
seems plausible, better methods are available in future to analyse the 
tine-shared bus and crossbar switch systems, these may be profitably used 
in this ELuribus nodel t o give better prediction of the system performance, 
further, if analytic expressions are found for both the sub-systems, it 
nay even become possible to express the Pluribus performance analytically 
by solving Equation (5.1 ). 

However, the absence of an analytical expression for the ELuribus 
performance in no way detracts from the usefulness of the model. The program 
written to implement the analysis nethod is quite simple and allows a system 
designer to quickly evaluate a large design space in an efficient way. This 
should help in evaluating the performance of systems based on the Pluribus 
architecture, and in pointing out where the bottlenecks lie and what netnods 
to use for improving the system performance. 

It is also our hope that our efforts vd.ll spark an interest in devising 
better analytic models for multiprocessor systems. The tools available in 
this area are still pitifully few and a lot of work is needed to catch up 
with the rapid growth in multiprocessor technology. Work on performance 
evaluation has so far been limited only to crossbar switch systems. Me 
have now given analytic nodels for tine-shared bus and Pluribus systems. 



Those models themselves stand in need of improvement. In addition, 
ifc is necessary to work on models for nultiport memory/ multi bus systems 
and for other unconventional systems which are being designed these days, 
bith the advent of nicroprocessors on the computer scene, it is now be- 
coming feasible to construct multiprocessor systems containing a large 
number of processors (typically hundreds of microprocessors) which represents 
an increase of an order of magnitude in the number of functional units 
connected to the system. Clearly, the interconnection mechanism used in 
these systems is of crucial importance and work is in progress to devise 
now and better interconnection structures [Bar 75, Swa 76], To keep pace 
with these developments, the tools of analytic modelling must be improved 
so that the system designer can easily evaluate and compare the various 
choices he faces. We hope that this thesis has been one step forward towards 
these goals. 
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APPENDIX 

TRANSITION MATRIX POR THE BUS MODEL EXAMPLE 

For the bus nodel discussed in Section 4*4, vath p=2, n =2, n =1 , the transition natrix is: 
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